{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "34589b64-1a75-49b1-8e5e-1ea3cd857111",
   "metadata": {},
   "source": [
    "## Lesson 3: Visualizing Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88adc7f3-52b6-432f-80c4-995605b556eb",
   "metadata": {},
   "source": [
    "#### Project environment setup\n",
    "\n",
    "- Load credentials and relevant Python Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aeeebfae-c08f-4c9d-a714-9090a04fbc5c",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "from utils import authenticate\n",
    "credentials, PROJECT_ID = authenticate() #Get credentials and project ID"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b56c0f4-cfc3-4f6a-a189-f30f08cb585e",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "REGION = 'us-central1'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ba3a1c7-2d3a-41de-91b9-d3780801742b",
   "metadata": {},
   "source": [
    "#### Enter project details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3519fb1-7ba3-481b-ac5d-c45cfa6b27a0",
   "metadata": {
    "height": 114
   },
   "outputs": [],
   "source": [
    "# Import and initialize the Vertex AI Python SDK\n",
    "\n",
    "import vertexai\n",
    "vertexai.init(project=PROJECT_ID, \n",
    "              location=REGION, \n",
    "              credentials = credentials)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dea550bf-d3fb-4ae2-80da-4f3d9179a9b9",
   "metadata": {},
   "source": [
    "## Embeddings capture meaning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "763e2623-ca37-43ed-90d5-d0e52cc6fba5",
   "metadata": {
    "height": 335
   },
   "outputs": [],
   "source": [
    "in_1 = \"Missing flamingo discovered at swimming pool\"\n",
    "\n",
    "in_2 = \"Sea otter spotted on surfboard by beach\"\n",
    "\n",
    "in_3 = \"Baby panda enjoys boat ride\"\n",
    "\n",
    "\n",
    "in_4 = \"Breakfast themed food truck beloved by all!\"\n",
    "\n",
    "in_5 = \"New curry restaurant aims to please!\"\n",
    "\n",
    "\n",
    "in_6 = \"Python developers are wonderful people\"\n",
    "\n",
    "in_7 = \"TypeScript, C++ or Java? All are great!\" \n",
    "\n",
    "\n",
    "input_text_lst_news = [in_1, in_2, in_3, in_4, in_5, in_6, in_7]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "726ce640-0505-4d01-9620-94502af9eb70",
   "metadata": {
    "height": 97
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from vertexai.language_models import TextEmbeddingModel\n",
    "\n",
    "embedding_model = TextEmbeddingModel.from_pretrained(\n",
    "    \"textembedding-gecko@001\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b270e703-62fd-47cf-893e-3fe7a1282f29",
   "metadata": {},
   "source": [
    "- Get embeddings for all pieces of text.\n",
    "- Store them in a 2D NumPy array (one row for each embedding)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30dbf38d-5f88-4ee0-804c-87099b5d6693",
   "metadata": {
    "height": 131
   },
   "outputs": [],
   "source": [
    "embeddings = []\n",
    "for input_text in input_text_lst_news:\n",
    "    emb = embedding_model.get_embeddings(\n",
    "        [input_text])[0].values\n",
    "    embeddings.append(emb)\n",
    "    \n",
    "embeddings_array = np.array(embeddings) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9129943c-d624-4ada-9dbf-8277fc12554d",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "print(\"Shape: \" + str(embeddings_array.shape))\n",
    "print(embeddings_array)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "deffaf3d-f57e-4b14-a391-620194c5d354",
   "metadata": {},
   "source": [
    "#### Reduce embeddings from 768 to 2 dimensions for visualization\n",
    "- We'll use principal component analysis (PCA).\n",
    "- You can learn more about PCA in [this video](https://www.coursera.org/learn/unsupervised-learning-recommenders-reinforcement-learning/lecture/73zWO/reducing-the-number-of-features-optional) from the Machine Learning Specialization. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8fa6ec2-fa3b-47da-831a-1892dd8560dc",
   "metadata": {
    "height": 114
   },
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\n",
    "\n",
    "# Perform PCA for 2D visualization\n",
    "PCA_model = PCA(n_components = 2)\n",
    "PCA_model.fit(embeddings_array)\n",
    "new_values = PCA_model.transform(embeddings_array)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8743daf4-31f3-467d-be38-fc5984857682",
   "metadata": {
    "height": 46
   },
   "outputs": [],
   "source": [
    "print(\"Shape: \" + str(new_values.shape))\n",
    "print(new_values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9328db18-bb41-4405-8139-b62f69850a2a",
   "metadata": {
    "height": 114
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import mplcursors\n",
    "%matplotlib ipympl\n",
    "\n",
    "from utils import plot_2D\n",
    "plot_2D(new_values[:,0], new_values[:,1], input_text_lst_news)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20c789c6-7f12-41ab-96a0-15aea8082a1a",
   "metadata": {},
   "source": [
    "#### Embeddings and Similarity\n",
    "- Plot a heat map to compare the embeddings of sentences that are similar and sentences that are dissimilar."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99dcb75a-abd7-4a45-a78b-3800fb4cf559",
   "metadata": {
    "height": 233
   },
   "outputs": [],
   "source": [
    "in_1 = \"\"\"He couldn’t desert \n",
    "          his post at the power plant.\"\"\"\n",
    "\n",
    "in_2 = \"\"\"The power plant needed \n",
    "          him at the time.\"\"\"\n",
    "\n",
    "in_3 = \"\"\"Cacti are able to \n",
    "          withstand dry environments.\"\"\" \n",
    "\n",
    "in_4 = \"\"\"Desert plants can \n",
    "          survive droughts.\"\"\" \n",
    "\n",
    "input_text_lst_sim = [in_1, in_2, in_3, in_4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f86c0c3d-87d9-4601-bc03-dd998bf12166",
   "metadata": {
    "height": 131
   },
   "outputs": [],
   "source": [
    "embeddings = []\n",
    "for input_text in input_text_lst_sim:\n",
    "    emb = embedding_model.get_embeddings([input_text])[0].values\n",
    "    embeddings.append(emb)\n",
    "    \n",
    "embeddings_array = np.array(embeddings) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5057cff-18dc-42a8-918a-86e54c2e7c0f",
   "metadata": {
    "height": 131
   },
   "outputs": [],
   "source": [
    "from utils import plot_heatmap\n",
    "\n",
    "y_labels = input_text_lst_sim\n",
    "\n",
    "# Plot the heatmap\n",
    "plot_heatmap(embeddings_array, y_labels = y_labels, title = \"Embeddings Heatmap\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b984ee27-fa85-4839-b799-748b9d82107e",
   "metadata": {},
   "source": [
    "Note: the heat map won't show everything because there are 768 columns to show.  To adjust the heat map with your mouse:\n",
    "- Hover your mouse over the heat map.  Buttons will appear on the left of the heatmap.  Click on the button that has a vertical and horizontal double arrow (they look like axes).\n",
    "- Left click and drag to move the heat map left and right.\n",
    "- Right click and drag up to zoom in.\n",
    "- Right click and drag down to zoom out."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e075dc41-18e8-40e1-8435-d0f972d0c8ab",
   "metadata": {},
   "source": [
    "#### Compute cosine similarity\n",
    "- The `cosine_similarity` function expects a 2D array, which is why we'll wrap each embedding list inside another list.\n",
    "- You can verify that sentence 1 and 2 have a higher similarity compared to sentence 1 and 4, even though sentence 1 and 4 both have the words \"desert\" and \"plant\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0dff11a0-ae2b-4898-b565-3be7439a27b0",
   "metadata": {
    "height": 29
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics.pairwise import cosine_similarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05e99af9-2bf6-4ba5-ad72-809a12c6fca1",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "def compare(embeddings,idx1,idx2):\n",
    "    return cosine_similarity([embeddings[idx1]],[embeddings[idx2]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8842d87c-1abf-4908-9ff7-ab8b3200532e",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "print(in_1)\n",
    "print(in_2)\n",
    "print(compare(embeddings,0,1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35860466-0090-4c86-b39a-baf6926b613c",
   "metadata": {
    "height": 63
   },
   "outputs": [],
   "source": [
    "print(in_1)\n",
    "print(in_4)\n",
    "print(compare(embeddings,0,3))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3f2115e-785b-45a6-acb0-431e015420d5",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
